Add C3ID implementation #21

nguerrera · 2025-01-08T16:01:25Z

This implements C3ID computation that we will need soon.

The implementation differs from the one in https://github.com/microsoft/security-utilities and this seeks to become the new standard.

Changes:

There is no use of "Microsoft"
There is no byte array to hex conversion in the intermediate computation
Optimized to avoid allocations
Raw value is truncated to 12 bytes instead of 15
"C3ID" is prefixed to the canonical representation in base64

Tests include a naive, unoptimized implementation that is easier to understand, and we check that both implementations match. We also check that we produce the same deterministic value for a sampling of hard-coded test input.

This change also fixes edge cases with empty span input in encoding polyfills.

This ports C3ID computation (that we will need soon) from https://github.com/microsoft/security-utilities Changes: - Company name is a parameter and not hard-coded to Microsoft - Optimized like the rest of Cask to avoid allocations and write the result to a span Tests added here were also run against the implementation in security-utilities to check equivalence. Also: fix edge cases with empty span input in encoding polyfills.

nguerrera · 2025-01-08T16:41:12Z

#22 should fix the build issue. Let's get that reviewed and merged first.

src/Cask/CrossCompanyCorrelatingId.cs

michaelcfanning · 2025-01-09T17:22:43Z

src/Tests/Cask.Tests/CrossCompanyCorrelatingIdTests.cs

+    [InlineData("😁", "f/BTV0j6A8km4KDw7aJz")]
+    public void Test_Basic(string text, string expected)
+    {
+        string actual = ComputeC3IDBase64(company: "Microsoft", text);


Hey is the proper rendering C3ID or C3id?

Haha. It is a true fact that I treat this as a valid question despite the obvious triviality of the concern.

I think I prefer C3id but need I mention I have no desire to die on this hill? :)

Tehe. I think I prefer C3Id or C3ID over C3id. Id is a word so if the convention in C# is CreateUrl, C3Id would be the most consistent.

It does not appear in any identifier other than in tests. As such, my care factor is vanishingly small. I'm just as picky about these things in public API, so no judgment. 😁 I'll happily change it, but I still don't know which to use...

I went with C3Id now. I need it in non-test code in my next change. Still not public, but I think it's worth settling this right now and not having noise of renaming it mixed in with other changes.

My reason for C3Id over C3id is a nod to my much younger self that rewrote the FxCop casing tokenizer ~20 years ago. That would parse this as two tokens C3 and id. And only C3Id respects the guidelines with that tokenization. 😁

I'd ask to get rid of the number in the middle of an acronym, but I think I've caused enough trouble asking for algo changes.

src/Cask/CrossCompanyCorrelatingId.cs

src/Tests/Cask.Tests/CrossCompanyCorrelatingIdTests.cs

src/Cask/CrossCompanyCorrelatingId.cs

src/Tests/Cask.Tests/CrossCompanyCorrelatingIdTests.cs

- Use base64-decoded prefix for hash input and base64-encode input - Throw on empty or whitespace

src/Tests/Cask.Tests/CrossCompanyCorrelatingIdTests.cs

nguerrera · 2025-01-15T16:33:16Z

src/Cask/CrossCompanyCorrelatingId.cs

+        Prefix.CopyTo(input);
+        SHA256.HashData(textUtf8, input[Prefix.Length..]);
+
+        // Perform second hash, truncate, and copy to destination.


I'm getting concerned about the cost here. Two rounds of SHA256 is a lot of compute for a 12 byte hash.

For context from offline conversation: the rationale (as I understand it, but I may not) for two rounds in C3ID is to make it possible to compute the C3ID from the plain SHA256. Without sharing the key and without adding C3ID to their toolkit, someone can send you the SHA256 of a key, and then you can convert that to C3ID.

We should maybe reconsider if this scenario is worth the cost added to CASK scenarios. I'll provide some data on that cost in a minute.

Without C3ID in hash scenarios

From #18 (comment)

Method Mean Error StdDev Gen0 Allocated

CompareHash_Cask 451.609 ns 3.9292 ns 3.6753 ns - -

CompareHash_Floor 294.875 ns 2.7652 ns 2.5866 ns - -

GenerateHash_Cask 300.978 ns 1.3669 ns 1.2786 ns 0.0091 176 B

GenerateHash_Floor 225.569 ns 1.6476 ns 1.2864 ns 0.0057 112 B

With C3ID in hash scenarios

From my work-in-progress branch:

Method Mean Error StdDev Gen0 Allocated

CompareHash_Cask 856.6 ns 6.06 ns 5.37 ns - -

CompareHash_Floor 298.5 ns 3.69 ns 3.27 ns - -

GenerateHash_Cask 725.1 ns 5.20 ns 4.86 ns 0.0114 216 B

GenerateHash_Floor 226.3 ns 0.78 ns 0.69 ns 0.0057 112 B

There are some other changes in this branch so this doesn't quite isolate the C3ID cost. However, I also measured on this machine that a single SHA256 of Cask key sized data -- without allocation, copying, or encoding conversion -- takes about 120ns on this machine. That's a lot relative to the "floor" scenarios we're aiming to replace.

Up to now, I was confident in being able to tell a good story about how the extra Cask costs were a fair price to pay for Cask features, but I am not sure C3ID is living up to that now.

With that said, I think it would be good to take this change and discuss options to improve this in follow-up.

Ack on this and let's talk more.

michaelcfanning

nguerrera requested review from michaelcfanning, rwoll and suvamM as code owners January 8, 2025 16:01

nguerrera force-pushed the c3id branch from de47666 to a5c7249 Compare January 8, 2025 16:06

nguerrera closed this Jan 9, 2025

nguerrera reopened this Jan 9, 2025

michaelcfanning reviewed Jan 9, 2025

View reviewed changes

src/Cask/CrossCompanyCorrelatingId.cs Outdated Show resolved Hide resolved